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ABSTRACT 

Gene ontology analysis has become a popular and 
important tool in bioinformatics study, and current 
ontology analyses are mainly conducted in individ- 
ual gene or a gene list. However, recent molecular 
network analysis reveals that the same list of genes 
with different interactions may perform different 
functions. Therefore, it is necessary to consider mo- 
lecular interactions to correctly and specifically 
annotate biological networks. Here, we propose a 
novel Network Ontology Analysis (NOA) method to 
perform gene ontology enrichment analysis on bio- 
logical networks. Specifically, NOA first defines 
link ontology that assigns functions to interactions 
based on the known annotations of joint genes 
via optimizing two novel indexes 'Coverage' and 
'Diversity'. Then, NOA generates two alternative ref- 
erence sets to statistically rank the enriched func- 
tional terms for a given biological network. We 
compare NOA with traditional enrichment analysis 
methods in several biological networks, and find 
that: (i) NOA can capture the change of functions 
not only in dynamic transcription regulatory net- 
works but also in rewiring protein interaction 
networks while the traditional methods cannot and 
(ii) NOA can find more relevant and specific func- 
tions than traditional methods in different types of 
static networks. Furthermore, a freely accessible 
web server for NOA has been developed at http:// 
www.aporc.org/noa/. 

INTRODUCTION 

The concept of biological function is fundamental for the 
genome research. Gradual accumulation of biological 



knowledge inspirits the emergence of Gene Ontology 
(GO) project which allows annotating tens of thousands 
of genes in various species. Up to 26 October 2010, there 
have been more than 2 753 338 annotations in 48 species 
available in GO database (1), which provide considerable 
knowledge for biologists to understand the behavior of a 
specific gene or gene product in a biological system. 

These gene annotations provided by GO project 
describe the function of a single gene or gene product, 
but biologists are more interested in the GO enrichment 
analysis of a large gene list since widely applied 
high-throughput genomic, proteomic and bioinformatics 
scanning technologies, such as DNA microarray and 
protein mass spectrometry, usually result in a set of dif- 
ferentially expressed genes or proteins under studied bio- 
logical conditions; that is, the follow-up functional 
analysis of this large gene list becomes important in re- 
vealing biological meanings and allowing further experi- 
mental validation. To address this challenge, a number of 
GO functional enrichment tools have been developed. 
Recently, Huang et al. (2) comprehensively reviewed 68 
bioinformatics enrichment tools and classified them into 
three classes: singular enrichment analysis, gene set enrich- 
ment analysis and modular enrichment analysis. Khatri 
et al. (3) generally compared the limitations and charac- 
teristics of 14 tools in terms of scope of analysis, visual- 
ization capabilities, statistical model and correction for 
multiple comparisons, etc. Although each tool has 
distinct strengths (4-8), the common motivation behind 
these tools is to list the associated GO terms for the inter- 
esting gene list and then statistically identify the most 
enriched or significant biological annotations. 

However, an important lesson from network biology is 
that molecular interactions in addition to single molecules 
can be biologically meaningful (9). To be precise, genes 
carry out their specific functions by their temporal inter- 
actions and may change function by interacting with dif- 
ferent neighbors (10). This implicates that functional 
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analysis of gene list (without considering interactions) is 
still far from the 'optimal annotation'. Therefore, there is 
a clear need to annotate functions by simultaneously con- 
sidering molecules and their interactions (11), i.e. to anno- 
tate biological function to biomolecular networks or 
biological networks (9,12). A biological network is defined 
as a set of nodes and links (edges). Usually, nodes repre- 
sent genes or their products and if two nodes have some 
type of interactions, there will be a link (an edge) between 
them. Currently, many biological networks have been ex- 
tensively studied, such as protein interaction networks 
(13), gene regulatory networks (14) and metabolic 
networks (15). In particular, some condition-specific sub- 
networks have been constructed to investigate fundamen- 
tally important biological problems, such as the disease- 
aging network (16), human liver metabolic network (17) 
and B-cell transcriptional regulatory networks (18). 

Furthermore, recent studies reveal that biological 
network is dynamic with network rewiring under different 
external responses and emergence or vanishing of edges 
along with temporal or spatial changes. An example for 
transcriptional network dynamics in the yeast transcrip- 
tional regulatory networks at different conditions is 
indicated in ref. 19, and an example for protein-protein 
interaction network dynamic in the tissue-specific protein 
interaction networks in ref. 20. These examples imply that 
the same gene list with different ways of interactions in 
different conditions has significantly different biological 
meanings or functions. Thus, functional analysis on bio- 
logical networks (considering both genes and interactions) 
would surpass the ability of current function enrichment 
analysis tools on gene list (considering only genes). To 
show this, we present several examples in Figure 1. The 
first example is that two protein interaction networks, 
derived from the same set of proteins, may have different 
functions due to different mode of connection. As shown 
in Figure 1(A1), a typical example reported in the disease 
research is the so-called 'edgetic', which finds that many 
human inherited disorders are caused not by a gene 
removal (node removal) but by an edge removal (21). 
Gene networks of patients and healthy people have the 
same gene list. But the connections are different, and 
therefore they have fundamentally different phenotypes. 
In this situation, current gene list methods (GLMs) 
clearly cannot tell the difference because the edge infor- 
mation is not considered. Another example is from tran- 
scription regulatory process. As shown in Figure 1(A2), 
TBL1 can be a repressor of RARB when forming a 
complex with GPS2, TBLR1, HDAC3 and NCoR, and 
it can also be an activator of pS2 cooperating with 
others (22). This suggests that the function of a gene 
depends on its interacting partners. Furthermore, many 
networks are shown to be dynamic. For example in 
Figure 1(A3), the regulatory network of yeast can be 
very different in different conditions (19). Taken 
together, it is in an exigent need to develop new analysis 
methods to analyze function of biological networks by 
fully exploiting network topological information. 

In this paper, we introduce a novel Network Ontology 
Analysis (NOA) method. Given a biological network, 
NOA first retrieves all available GO annotations of 



individual genes from GO database, and then assigns 
GO terms to links between two genes through optimizing 
two indexes: 'Diversity' and 'Coverage'. Then two alter- 
native strategies, whole-net and sub-net, are applied to 
choose the reference set to statistically test which functions 
(GO terms) are significantly enriched. In Figure 1(B), we 
conceptually compare our method with the existing 
methods. We classify ontology analysis methods into 
three levels: individual gene, gene set and network. 
Individual gene annotation is based on the available 
gene information such as DNA sequence, protein struc- 
ture and associated phenotype to infer the functions of a 
single gene or a gene product. Software, such as Blast2GO 
(23) and GoAnnotator (24), helps to annotate genes at this 
level. GLMs conduct enrichment analysis in a gene set 
based on hypothesis testing. Tools such as FatiGO (4), 
DAVID (5), g:profiler (6) and BiNGO (7) belong to this 
category. Fundamentally different from the existing 
methods, our NOA is the first computational tool to 
focus on the functional analysis of link and network. We 
will show that NOA can find more relevant and specific 
enriched GO functions and, in particular, can capture the 
functional change with network rewiring. 



METHODS 

Link ontology analysis 

Links or biomolecular interactions are the building blocks 
of a biological network. To analyze the function of a 
network, the first step is to investigate the function of 
links in the network. As shown in Figure 2, gene 
ontology is illustrated in rectangles and represented as a 
directed acyclic graph. The annotation of each node in the 
network can be obtained from the existing GO annotation 
database. As a result, genes are annotated by black terms 
in the corresponding directed acyclic graph. Then our task 
at hand is to properly define the annotation of link with 
GO terms from the nodes. 

Mathematically, a given biological network is repre- 
sented as N — (V,E), where V is the set of all genes and 
E is all interactions. For each gene g m e V, we first retrieve 
all relevant GO annotations on this gene, and propagate 
these annotations upward through the GO term hier- 
archy, i.e. any gene annotated to a certain term t k 
is also explicitly annotated by all the ancestors of t k . 
As a result, we have a term list T m representing 
all known terms annotating gene g m . Our task is to 
determine the GO term assignments on all link i.e. 
T(£) = {T mn \im,n : e mn e E}, where e ma represents an 
edge connecting g m and g m in E, and T mn represents the 
set of all terms assigned to the edge e mn . Intuitively, rea- 
sonable assignment T mn on e mn should be consistent with 
the gene assignment T m and T n on g m and g m . Given T mn , 
to quantitatively measure the functional inconsistency, we 
define the 'Diversity' of T mn as 

f(= T ^Mmnl 
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Figure 1. Schematic examples to illustrate the motivation of NOA. (A) Schematic examples to illustrate the motivation of considering interactions in 
functional enrichment analysis. Here, we list three situations where gene list based methods fail. (Al) illustrates the concept of 'edgetic' (21). Many 
complex diseases are caused by edge removal instead of node removal from wild type. The red node is an important protein related to some kind of 
disease. Although the concentration of the protein does not change, mutation causes an interaction broken, and leads to disease. This cannot be 
detected by single gene or gene list based analysis. (A2) shows that TBL1 has fundamentally different functions when joining different transcriptional 
complexes by acting as either co-repressor or co-activator (22). The blue line stands for DNA. (A3) is an example for network rewiring of yeast 
transcriptional networks (19). We show the yeast transcriptional regulatory network in the left and the corresponding TF co-regulatory network in 
the right. Interactions or genes are colored as red if they are active in cell cycle, blue in sporulation, and black in both processes. (B) Three levels of 
ontology analysis. Gene ontology is based on the information of sequence, structure, phenotype, etc. to infer the function of single genes or gene 
products. Gene list ontology analysis performs enrichment analysis in a gene list based on hypothesis testing. Most tools such as FatiGO, DAVID, 
g:profiler and BiNGO are in this level. Our NOA addresses the problem of network ontology analysis and conceptually belongs to the biological 
network level. 



where lis an indicative function, i.e. / equals to one when 
the corresponding event is true and zero otherwise; | T mn | 
represents the number of terms assigned to the edge e mn . 
We can conclude from the definition that D(T ma ) should 
be small in an efficient assignment. Furthermore, the 
'Diversity' of the assignment on network (all links), i.e. 
T(£) is defined as the average D(T ma ): 



D[T(£)] 



E 



Vm 



D(T mn ) 
\E\ 



(2) 



Similarly, we define the 'Coverage' of T(£), which is 
the average CyT mni ,7~ , mil2 , ■ ■ ■ ,T mni ^, where n\ 9 • • • are 
the indexes of k genes connecting to gene g m . 



mri] j ^mri2 j ' 



>7mn t ) implies the coverage ratio of all 



functions on node g m , covered by the functions of all 
edges connecting to g m . Particularly, 



C[T(E)]= J2 



C(T, 



mri| j^~mii2> ' 



j Tmn t ) 



Vm:^ m el 



\y\ 



(3) 



where T mni ,T mni , ■ ■ ■ ,T mYlk represent function assignments 
of all edges connecting g m , and 



C(^~mrii , T\ 



mrb i 



E 

'elm 



I(t e T mn , U T nW2 U ■ ■ ■ U T nmt ) 



(4) 



From the definition of 'Coverage', an efficient assign- 
ment should maximize 'Coverage'. Obviously, both 
'Coverage' and 'Diversity' are within [0,1]. 

Actually, the problem of link ontology analysis is the 
process of balance 'Coverage' and 'Diversity'. It can be 
easily proved that the simply GO term overlap strategy 
T mn — T m n T n is in fact an optimal solution by 
maximizing 'Coverage' subject to 'Diversity' at zero (see 
Supplementary Text SI for detail). As shown in Figure 2, 
D(T mn ) = D(T np ) = 0, C(T mn ) = C{T np ) = 1, but 
C(T mn ,T np ) = 5/6 since t 6 is not covered by the union of 
T mn and T np . Therefore, we have D = 0, and C = 17/18. 
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Figure 2. The schematic plot of the definition of link ontology. Gene ontology is structured as a directed acyclic graph illustrated in rectangle. The 
annotation of each gene in the network is from GO database. For example, gene g„, is annotated by black terms t\, t 2 . t^, t$ in tree T m . Our task is to 
define the function annotation of interactions, e.g. e mn based on the annotations of genes. One simple way to annotate links is to calculate the 
overlap of GO term sets T m and T n of the interacting nodes g„, and g„, e.g. T,„„ = T„f\T„. 



Hence, the straightforward way to calculate the overlap 
of GO term sets for the interacting nodes, naturally 
assigns functions to a link by maximizing 'Coverage' 
meanwhile minimizing 'Diversity 1 . Biologically, this 
strategy implies that two genes interact with each other 
to perform a same biological function together. In this 
paper, we use this simple strategy to define the link 
ontology in a biological network. Next, we can further 
define network ontology via regarding the network as a 
set of links. 

Network ontology analysis 

With the above definition for link ontology, we treat the 
biological network as a set of links. Then network 
ontology analysis is essentially a statistical test to assess 
the enrichment of GO terms in this set of links. The cal- 
culation procedure is shown in Figure 3. Given two 
networks, one is the input test network to be annotated 
and we collect the links in the network as a test set. The 
other is the reference network as the control for statistical 
test and we collect the links as a reference set. Given a GO 
term t k , we count the number of occurrence of t k in the test 
and reference set respectively. A Venn diagram shows the 
relationship between the frequency of this GO term in the 
reference set and the one in the test set. From the diagram, 
we will infer whether or not the GO term t k is enriched in 
the test set. There are several statistical models to test this, 
including but not limited to, hypergeometric test, Fisher's 
exact test, binomial and / 2 . Here, we introduce one of the 
most popular and powerful methods, hypergeometric test. 
We suppose there are T links in the test set. Also there are 
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Figure 3. Illustration of the network ontology analysis by statistically 
testing the function enrichment. Simple Venn diagram is drawn for 
statistical test of network ontology analysis. The test set is all links 
in the input networks. The reference set is all possible links among 
genes in the test network for whole-net method by default, or a given 
background network for sub-net method. Given a GO term, the null 
hypothesis of the test is that genes with this GO term have the same 
probability to fall in the reference set and in the test set. R denotes the 
number of elements in the reference set; G means the number of 
elements annotated by the given GO term in the reference set; T indi- 
cates the number of elements in the test set; O denotes the number of 
elements annotated by the given GO term in the test set. 
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R links in the reference set, and G links in them are 
annotated by term ? /o . Here the null hypothesis is that 
links with annotation f& have the same probability to fall 
in the reference set and in the test set. We then treat the 
overlapping number of links J as a random variable. 
Under the null hypothesis, X follows a hypergeometric 
distribution. Then we can calculate a P-value score, 
which is defined as the probability that the overlapping 
number would assume a value greater than or equal to the 
observed value, O, by chance: 



P(X > O) 



min(G,7) 

£ 

k=0 



(5) 



The overlapping number is statistically significant if the 
P-va\ue score is smaller than a chosen cutoff. This process 
is applied for each GO term to pick out significant ones. 

The choice of the reference set is important in the stat- 
istical test. We provide two alternative methods in our 
implementation: whole-net and sub-net method. In the 
whole-net method, the reference set is chosen as all 
possible links in the test network, while in the sub-net 
method, it is chosen as all links in the pre-given back- 
ground network. Therefore, we can perform two types 
of NOA, i.e. whole-net NOA and sub-net NOA. As to 
the correction for multiple hypothesis testing (25), we 
used the frequently used correction methods: Bonferroni 
correction. 



RESULTS 

NOA captures functions in response to network dynamics 

One of the important advantages of NOA is that it can 
monitor the link dynamics in networks. More and more 
evidence shows that the same set of genes may form dif- 
ferent networks in response to temporal and spatial con- 
ditions (19-21,26,27). In this case, the traditional gene 
list-based functional enrichment analysis always reports 
the same result when the networks have the same node 
set but rewired structures. In contrast, NOA can detect 
such function changes caused by network structure 
change, and further capture the functional differences by 
fully taking advantage of the topology information of 
networks. In this section, we will introduce two applica- 
tions to illustrate the advantage of NOA. 

Example 1: dynamic transcription factor cooperation 
networks. Recently, Luscombe et al. (19) developed a 
method to uncover the conditional-specific transcription 
regulatory network by integrating transcriptional regula- 
tory information and gene expression data in yeast. 
Particularly, they first constructed a static background 
network which contains 7074 regulatory interactions 
among 142 transcription factors and 3420 target genes 
by assembling known regulatory interactions from the 
results of genetic, biochemical and chromatin immunopre- 
cipitation (ChlP)-chip experiments, and then integrated 



gene expression data of five conditions including cell 
cycle, sporulation, diauxic shift, DNA damage and stress 
response to reconstruct regulatory networks in each con- 
dition. As shown in Figure 1(A3), there are large changes 
of the regulatory network architecture in cell cycle and 
sporulation processes of Saccharomyces cerevisiae. Their 
results provide strong evidence that most gene functions 
arise in response to changing conditions and the rewired 
network structures. 

Here, we study whether or not the change of networks 
can be revealed by gene ontology enrichment analysis. Not 
surprisingly, both GLM and NOA can capture the differ- 
ence between the two types of biological processes (cell 
cycle and sporulation) because significantly expressed 
genes are different in the two stages [refer to left figure 
of Figure 1(A3)]. Particularly, we use NOA and GLM to 
test whether GO term 'cell-cycle process- is enriched in 
cell-cycle regulatory network comparing with the back- 
ground network. The P-value is 3.6e-27 for NOA and 
2.4e-23 for GLM. Similarly, P-values for GO term 'sporu- 
lation' in sporulation transcription regulatory network is 
1.3e-14 for NOA and 3.8e-20 for GLM. Both the methods 
work well, because the main differences between the two 
networks are basically in nodes. 

However, our question is if or not we can judge the 
stage of a cell with relatively incomplete information, 
e.g. without the information of target genes. We further 
construct transcription factor (TF) co-regulatory networks 
(28) via adding an edge between two TFs if they have 
at least one common target gene. This process is carried 
out in cell cycle, sporulation and background tran- 
scription regulatory networks, and correspondingly 
results in the three TF co-regulatory networks, i.e. cell 
cycle co-regulatory network [Figure 4(A1)], sporulation 
co-regulatory network [Figure 4(B1)], and background 
co-regulatory network [right figure of Figure 1(A3)]. As 
shown in Figure 4(A1) and (Bl), cell-cycle TF co-regu- 
latory network contains 67 TFs and 319 co-regulations, 
while sporulation TF co-regulatory network contains 
70 TFs and 302 cooperations (refer to Supplementary 
Table SI for detail). Most of the nodes in the two 
networks are the same (black nodes). Given this, we 
compare results of four methods, i.e. whole-net NOA, 
sub-net NOA, whole-net GLM and sub-net GLM, in TF 
co-regulatory networks in response to both cell cycle and 
sporulation. As shown in Table 1, the four methods are 
different in terms of the choice of test set and reference set. 
Here, the test set is chosen as all links in NOA, and all 
genes with links in GLM. Sub-net means choosing back- 
ground network (union of all possible co-regulatory 
networks) as the reference set in NOA, and choosing all 
TFs in the background network in GLM. Whole-net 
means choosing clique (there is a link between every two 
nodes) as the reference set in NOA, and choosing all yeast 
genes in GLM. The comparison results are shown in 
Figure 4 (refer to Supplementary Table 2S for detailed 
results). Figure 4(A2) shows the rank of all related terms 
by different methods in cell-cycle co-regulatory network. 
Pink bar stands for significant terms with _P-value less 
than 0.05, and red horizontal bar shows the position of 
GO:0022402 (cell-cycle process). We can find that 
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Figure 4. Applications of NOA on yeast TF co-regulatory networks. (A) NOA results on the yeast TF co-regulatory network in response to 
cell-cycle condition. (Al) illustrates the TF co-regulatory network. We construct TF co-regulatory networks by defining the TF co-regulation 
relationship if two TFs regulate at least one common target. (A2, A3) shows the comparison between NOA and gene list methods (GLM). (A2) 
presents the rank of all related GO terms in the four methods. Pink part represents significant terms with P-value less than 0.05, and the position of 
GO:0022402 (cell-cycle process) is shown by a red horizontal bar. (A3) shows P-values of GO:0022402 reported by the four methods. (B) NOA 
results on the yeast TF co-regulatory network in response to sporulation condition. (B2) The position of GO:0043934 (sporulation) is shown by a 
blue horizontal bar. (B3) shows P-values of GO:0043934 reported by four methods. The red dash line is the base line of -log(0.05). 



whole-net NOA, whole-net GLM and sub-net NOA 
report this term as significant term, but sub-net GLM 
fails. The corresponding P-value of this term is shown in 
Figure 4(A3). Additionally, whole-net NOA ranks 
GO:0022402 as top 5% which is much better than 
whole-net GLM 20%, although both of them report 
GO:0022402 as significant. Similarly, Figure 4(B2) and 
(B3) shows the significant tendency that NOA methods 
are better than GLM methods in identifying biologically 
reasonable functions of rewiring regulatory networks. To 
further prove the efficiency of NOA in rewiring networks, 
we compare NOA and GLM in a rewiring protein inter- 
action network as follows. 



Table 1. Test set and reference set of the four types of GO analysis 
methods: whole-net NOA, sub-net NOA, whole-net gene list method 
and sub-net gene list method 



Whole-net 



Sub-net 



NOA 

Test set 

Reference set 
GLM 

Test set 

Reference set 



Link list Link list 

Clique Background network 

Gene list Gene list 

Yeast gene Gene in background network 



NOA denotes the method we proposed in this article and GLM means 
gene list-based method. 



Example 2: rewiring protein interaction networks. We 
identified the rewired protein interaction networks during 
the progression of Alzheimer's disease (AD), which is a 
complex genetic disorder on nervous system affecting 
millions of elderly individuals worldwide (29). Clinically, 
AD is categorized into three stages: incipient, moderate 
and severe stages. More and more evidence indicates 
that the three stages have different features in molecular 
level (30,31). In our previous research work, we identified 
the different protein interaction networks in the three 
development stages via an edge-expansion scheme by 
combining protein interaction and microarray data (26). 
Traditional gene list-based methods can give enriched GO 
terms, such as regulation of transcription and DNA- 
dependent (refer to Table 2), which are, however, identical 
on all of the three stages, i.e. they cannot distinguish the 
dysfunctional differences among the three stages. By com- 
parison, we use whole-net NOA to analyze the three 



networks respectively. The results show different enriched 
biological processes for the protein interaction networks 
in three different stages. For instance, in the incipient 
stage, the protein interactions are annotated to perform 
the processes of vesicle-medicated transport and regula- 
tion of phosphorylation, etc., which implies AD dysfunc- 
tional progression of peptide cleavage and deposition (32). 
Regulation of kinase activity becomes the most enriched 
GO function which indicates the importance of regulation 
of phosphorylation in neurons during the AD develop- 
ment stage (33). Sterol transport, apoptosis and proteoly- 
sis are identified as the top-three ranked terms for the 
protein network in the severe stage. This provides evidence 
for neuron cell death and protein degradation in the 
serious disease stage of AD (30). Collectively, we can 
monitor the function change in various disease stages by 
NOA, which outperforms GLM. 
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Table 2. The functional characterization of protein interaction 
networks during Alzheimer's disease progression revealed by NOA 
and GLM 

Network type GO term (BP) Description 



Incipient 
NOA 



GLM 



Moderate 
NOA 



GLM 



Severe 
NOA 



GLM 



GO:0016192 
GO:0042325 
GO:0005979 

GO:0006355 

GO:0045944 

GO:0007242 

GO:0043549 
GO:0048589 
GO:0006897 
GO:0006916 
GO:0007165 
GO:0006355 



GO:0015918 
GO:0006915 
GO:0006509 

GO:0006355 

GO:0006629 
GO:0045944 



Vesicle-mediated transport 
Regulation of phosphorylation 
Regulation of glycogen biosynthetic 
process 

Regulation of transcription, DNA 

dependent 
Positive regulation of transcription 

from RNA polymerase II 

promoter 
Intracellular signaling cascade 

Regulation of kinase activity 
Developmental growth 
Endocytosis 
Anti-apoptosis 
Signal transduction 
Regulation of transcription, DNA 
dependent 

Sterol transport 
Apoptosis 

Membrane protein ectodomain 

proteolysis 
Regulation of transcription, DNA 

dependent 
Lipid metabolic process 
Positive regulation of transcription 

from RNA polymerase II 

promoter 



We manually choose top three non-reduplicated terms in the results by 
these two methods. Here, AD means Alzheimer's disease and BP means 
biological process. 



NOA identifies specific functions 

In addition to capture the function difference due to 
network rewiring, NOA can also be used in traditional 
static networks to find more specific GO annotations. 
The rationale is that NOA considers the interactions 
among the genes to allow the biological interpretation to 
be focused at the 'biological network' level. In this section, 
we will introduce two applications to demonstrate such an 
advantage of NOA. 

Example 1: KEGG pathway. The first example is the com- 
parison of NOA and GLM in Kyoto Encyclopedia of 
Genes and Genomes (KEGG) pathway (34,35). KEGG 
aims to uncover higher-order systemic behaviors of the 
cell by collecting reliable pathways, which is a valuable 
material for assessing NOA because the functions of the 
pathways have been well studied (34,35). 

As a proof-of-concept example, we focus on a specific 
pathway hsa05212, which is related to pancreatic cancer in 
Homo sapiens, and consists of 33 interactions. It is well 
known that tumor-related genes are important and tend to 
have many functions (36). So, we use this example to show 
that NOA can capture specified functions of the cancer 
by considering links among genes. Since one interactor 



in KEGG pathway may consist of multiple genes (in 
total there are 70 genes involved), we define the functions 
of an interactor by uniting functions of all related genes, 
and then apply whole-net NOA to analyze the function 
of the pathway. For comparison, we use g:profiler (6) to 
annotate the genes involved in this pathway. Top 20 
significant biological processes terms of the two 
approaches are extracted and listed in Supplementary 
Table S3. NOA captures the main feature of cancer, regu- 
lation of signal transduction, regulation of signaling 
process and anti-apoptosis (37), while g:profiler annotates 
these cancer genes by terms such as intracellular signaling 
cascade, positive regulation of cellular process and 
signaling. 

To quantitatively show the difference of the results 
generated by the two methods, we define 'specificity' for 
each GO term as the distance between the given term and 
the top term (biological process) in the GO hierarchy, i.e. 
in which level the term locates in GO directed acyclic 
graph. As shown in Figure 5(A), clearly NOA can 
identify the term which has much deeper level than 
GLMs (P-value = 0.0028 by rank-sum test). To visualize 
the comparison, we pick out and side by side compare the 
top five significant GO terms by NOA and by GLM, re- 
spectively, in the subgraph of GO-directed acyclic struc- 
ture (refer to Figure 6). Specifically, we first retrieve all 
ancestors of the 10 terms according to GO structure, and 
add relationships among these terms by directed edges. 
Then we highlight the top five terms of NOA, top five 
terms of GLM, top 20 terms of NOA (without top five) 
and top 20 terms of GLM (without top five) with dark 
yellow, dark green, buff and light green, respectively. 
Figure 6 clearly shows that NOA tends to give more 
specific annotations than GLM. For example, tumor is 
related to apoptosis, GLM ranks the term 'regulation of 
apoptosis' as top 20, but GLM cannot tell whether the 
pathway promotes apoptosis or represses apoptosis. 
NOA can be more specific according to rank 'anti- 
apoptosis' as top 5. 

Furthermore, to check whether NOA is better than 
GLM in other KEGG pathways, we try these two methods 
in all human KEGG pathways to evaluate the statistical 
efficiency of the methods. There are totally 226 human 
pathways collected in KEGG by now, during which 91 
contain more than 30 interactions. Both NOA and 
GLM are applied on these 91 pathways to rank related 
GO terms. All top 10 terms reported by NOA for these 
pathways are extracted and compared with that reported 
by GLM in specificity. Supplementary Figure SI shows 
that results of NOA have a significantly higher specificity 
than GLM, and the corresponding P-value is <2.7e-6 by 
Wilcoxon rank-sum test. This large-scale study strongly 
supports our conclusion that NOA outperforms GLM 
by revealing more specific functions for biological 
networks. 

Example 2: aging network. Another example is the analysis 
of aging network (refer to Supplementary Table S4). We 
assembly the aging network by identifying the genes 
related to aging then add a link if two genes interact with 
each other (16). In our previous work, we have shown that 
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Figure 5. Box plot to compare the specificity of the functional annotations revealed by NOA and gene list method. (A) Top 20 functional terms 
revealed by NOA and gene list method for pancreatic cancer pathway. (B) Top 20 functional terms identified by NOA and gene list method for aging 
network. The j-axis means the distance from a given term to the top term in GO structure, i.e. in which level the term locates to indicate the 
specificity of the functional term. Terms in deeper level are considered more specific. 
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Signaling 




Signaling 
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Figure 6. Comparison of results by NOA and GLM in pancreatic cancer pathway. Specifically we pick out the top five GO terms revealed by NOA 
and GLM [g:profiler (6)] and then side by side compare them in the GO-directed acyclic structure. Top five terms of NOA are highlighted by dark 
yellow, while top five by GLM are colored in dark green. Besides, terms labeled as buff are within top 20 in NOA results, and light green ones are 
within top 20 by GLM. The results show that NOA identifies more specific annotations in deeper levels of the GO hierarchy. 
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aging networks have close relationship with disease 
networks (16). Here, perform ontology enrichment ana- 
lysis with NOA and traditional GLM, respectively, and 
compare them. We find that two methods give different 
rank of GO terms (detailed results can be found in 
Supplementary Table S5). For instance, GLM prioritizes 
cell death, while NOA prioritizes metabolic process, which 
is biologically more reasonable. Furthermore, we compare 
the specificity of the function annotations revealed by the 
two methods. As shown in Figure 5(B), GO terms ident- 
ified by NOA are averagely more specific than GLM, i.e. 
these terms are in much deeper levels in GO hierarchy 
(P-value = 0.0051 by rank-sum test). 

Web server for NOA 

Based on the above result, we believe that NOA is a po- 
tential powerful tool to study the condition-specific func- 
tion of subnetworks and capture the function dynamics by 
network rewiring. Given the rapid advances of network 
biology studies, it is in pressing need for network ontology 
analysis. Thus, we implement NOA as a freely accessible 
web server, which is a collection of tools for whole-net 
NOA, sub-net NOA, whole-net GLM and subnet GLM. 
For whole-net methods, users can input either a gene list 
or a gene network, i.e. a link list by pasting in the text box 
or uploading a data file from their local disk, and then the 
web server will return the resulting rank of GO terms. 
Differently, for sub-net methods, reference gene list or ref- 
erence network is also necessary in addition to the input of 
test gene list or test gene network. 

It is worth mentioning that reference set is required to 
contain test set to make the Equation (5) valid to ensure 
biologically meaningful results. The default reference set is 
the fully connected network. Two parameters, species and 
cutoff for P-value, should be specified by users according 
to their own needs. Currently, NOA supports four types 
of species including H. sapiens, Mus musculus, Rattus 
norvegicus and Saccharomyces cerevisiae. 

As shown in Figure 7, the output of NOA is a ranked 
GO term list of biological processes (BP), cellular compo- 
nents (CC) and molecular functions (MF), additional 
with corresponding value of G, R, T, O in Formula (1), 
P-values, corrected P-values and related genes or 
links. Top 10 GO terms are highlighted in the resulting 
table. In addition, the rank of the significant GO terms 
can be downloaded via a hyperlink provided in the web 
page. 



DISCUSSION 

In this paper, we propose a novel function annotation 
tool for biological network, which is able to provide 
specific function annotations for the corresponding bio- 
logical system. One of the main contributions of our 
new method is to alleviate the nonspecificity problem 
due to the redundant nature of functional annotations. 
Usually when we obtain a large 'interesting' gene list by 
high-throughput techniques, the real biological insights 
are hidden in the large amount of general, redundant 
and nonspecific GO function annotations. We note that 
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Figure 7. The interface of web server for NOA. NOA is designed as a 
web tool to provide public service for network ontology analysis. Users 
can input networks via either directly pasting link list or uploading files. 
The web server will output all significant GO terms of biological 
processes (BP), cellular components (CC) and molecular functions 
(MF). 



there are a lot of efforts to deal with this problem. For 
example, the newly developed Functional Annotation 
Clustering of DAVID (5) groups similar annotations 
together to reduce the redundancy. Here, our NOA 
adopts a very different strategy by highlighting the inter- 
actions among the genes (edges) for a given large gene list. 
We believe that the interactions among genes are biologic- 
al meaningful and make the biology insights clearer and 
more focused in a specific condition. 

NOA is also helpful to reveal more specific function an- 
notation. In many cases, one single gene can be annotated 
by multiple functions. There is plenty of evidence to show 
that interactions play important biological roles to further 
make a distinction between the functions of single genes. 
For example, Cmdl in ref. 38 is a date hub and connecting 
with four modules, homeostasis of other cations, cell 
polarity and filament formation, endoplasmic reticulum 
and protein folding and stabilization in four different con- 
ditions. We do not know precisely the biological function 
of Cmdl if we only check individual gene. To overcome 
such difficulty, NOA infers specific functions by consider- 
ing these neighbor genes interacting with Cmdl in differ- 
ent conditions. Another example is that gene or protein 
may take several part-time jobs. For instance, eIF3f is an 
important housekeeping gene and is necessary for initi- 
ation of translation. Recent study shows that eIF3f has 
also dual role acting positively on Notch signal transduc- 
tion by interacting with other genes (39). NOA can recog- 
nize its correct function by examining its neighbor 
molecules in different working environment. 

The link ontology is important in the concept of our net- 
work ontology analysis. Therefore, it is crucial to well 
define functions of links. In fact, a similar concept 'edge 
ontology' or 'arrow ontology' has been suggested by a 
forward-looking work (40). Inspired by the gene ontology, 
Lu et al. aim to build a similar hierarchical term structure 
for edges. In their prototype of edge ontology, edges are 
partitioned in four levels: direction, type, sub-type and 
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specification. The complete edge ontology will provide a 
relatively explicit representation of the connections among 
genes in addition to revealing relationships among edges. 
However, edge ontology is still far from complete to 
describe the functional relationship in the network. In 
contrast, we note that gene ontology has contained 32 862 
terms and 2 753 338 annotations up to now. Therefore, 
NOA takes different strategy to define the function 
of edges based on existing rich GO terms instead of 
making a fresh start. To distinguish with previous edge 
ontology definition, we name the ontology defined in 
this paper as 'link ontology 1 . Here, we simply take the 
overlap of the GO term set of the two nodes to define the 
GO annotation of a link. This strategy is simple, easy to 
implement and accurate. The possible disadvantage is that 
the 'Coverage' may be low. In fact, we can also build a 
general integer programming model to define link ontology 
by optimizing both D and C and considering the GO hier- 
archical structure (41,42) (refer to the Supplementary 
Text S2 for detail). Importantly, our new model can inte- 
grate more information to predict link ontology in a larger 
'Coverage' without a significant increase of computational 
cost. Given the fact that the annotations of gene function 
are far from complete (43), NOA is an important step 
toward annotating functions on a biological system 
since it actually offers a novel way to infer edge function 
additional with gene function. 

The choice of reference set allows NOA to report 
specific significant terms in different levels according to 
the users' need. Choosing all possible links within given 
nodes as the background is to avoid possible bias. For 
example, some genes such as P53 and c-Myc are very im- 
portant; so, many studies focus on these kinds of genes 
(44). Accordingly, many functions are annotated on these 
genes. On the other hand, functions of other genes are 
barely characterized. To our knowledge some methods, 
such as BiNGO, try to reduce this bias via choosing an 
appropriate subnetwork. The choice of a reference set is 
still an open problem in the functional enrichment ana- 
lysis. Currently, the computational complexity of NOA is 
0(n 2 ), where n is the number of genes in the input 
network. There is still room for further improvement 
by sampling random networks, which seem to be more 
reasonable since the random process has no bias. 

In our paper, we showed that NOA is helpful to capture 
the function change by network rewiring. Here, network 
rewiring means the change of existence for the links. 
However, in many cases, biological networks change by 
the weights of links instead of their existence; so it is 
necessary to further extend NOA to handle weighted 
networks. An intuitive idea is to enhance the role of 
links with larger weights by duplicating the GO terms 
annotated on them. As a result, the numbers in 
Equation (5) can be recounted accordingly for the statis- 
tical test. In addition to weighted networks, we note that 
directed networks are also important in many biological 
systems. In the current NOA, we handle directed networks 
by treating them as its corresponding undirected network. 
This will not fully utilize the edge information, and we will 
introduce more precise model to functionally annotate 
directed network in our further work. 



Another direction of improvement is to consider more 
about the relationships or correlations among GO terms. 
This is important because relationships among GO terms 
are represented by an acyclic digraph, and simply 
propagating these annotations upward through the GO 
term hierarchy or treating the GO terms independently 
in statistical test will lose certain information. If more in- 
formation can be added in the analysis process, the results 
will be more meaningful. Besides, the concept of network 
ontology on edge can be extended to super-graph whose 
edge may be consisted of more than two nodes. If we 
consider a module as a basic element to carry out func- 
tions, it will be plausible to define 'module ontology' 
rather than 'link ontology' to do supergraph ontology 
analysis. Lastly, in our model, we consider a network as 
a collection of links. It may also be meaningful to consider 
node and edge at the same time. In summary, there is still 
much room to extent the current network ontology 
analysis framework. 

CONCLUSIONS 

We proposed a novel GO functional enrichment analysis 
method for biological network analysis. Our method is 
different from the traditional methods by considering 
the additional biological significance of molecular inter- 
actions. First, we proposed a novel scheme to infer link 
ontology from gene ontology by optimizing two indexes 
'Diversity' and 'Coverage'. Based on the link ontology, we 
gave two alternative approaches to implement network 
ontology analysis, i.e. whole-net and sub-net NOA. To 
prove the effectiveness of NOA, we applied it in several 
real biological networks. The results show that NOA can 
reveal much reasonable biological meanings than GLM in 
both dynamic networks and static networks. Furthermore, 
we developed a freely accessible web server for NOA, 
which allows network ontology analysis online and can 
help researchers to identify specific and efficient GO 
terms in their practical usage. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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